January 2019 - Cesky Krumlov

But first a beautiful chair

Goals of this short course in R

  • To help you perform your research through instruction in the core components of
    • data collection and organization
    • manipulation and analysis
    • interpretation and presentation
  • Provide a broad coverage of the core components of modern biological statistics
  • Provide you with the computational tools necessary to carry out your work - namely R and affiliated tools

Why do we need statistics?

  • We almost never know the world perfectly, but still want to draw conclusions or make decisions
  • We may need to
  • estimate underlying parameters from samples of data
  • test hypotheses using data
  • summarize and/or visualize large amounts of data
  • There are well known mathematical rules that help us
  • Statistics can be done by hand, but computers let us do most of the mathematics quickly

A biological example to get us started

Say you perform an experiment on two different strains of stickleback fish, one from an ocean population (RS) and one from a freshwater lake (BP) by making them microbe free. Microbes in the gut are known to interact with the gut epithelium in ways that lead to a proper maturation of the immune system.

A biological example to get us started

EXPERIMENTAL SETUP - You carry out an experiment by treating multiple fish from each strain so that some of them have a conventional microbiota, and some are inoculated with only one bacterial species. You then measure the levels of gene expression in the stickleback gut using RNA-seq. You suspect that the sex of the fish might be important so you track it too.

A biological example to get us started

  • GETTING THE DATA READY TO ANALYZE
    • How should the data set be organized to best analyze it?
    • What are the key properties of the variables?
    • Why does that matter for learning R?
    • Why does that matter for performing statistical analyses?

NOTE - Include a slide that highlights a data table for this experiment

Data set rules of thumb (aka Tidy Data)

  • Store a copy of data in nonproprietary formats
  • Leave an uncorrected file when doing analyses
  • Maintain effective metadata about the data
  • When you add observations to a database, add rows
  • When you add variables to a database, add columns
  • A column of data should contain only one data type

Why use R?

  • Good general scripting tool for statistics and mathematics
  • Powerful and flexible and free
  • Runs on all computer platforms
  • New enhancements coming out all the time
  • Superb data management & graphics capabilities
  • Reproducibility - can keep your scripts to see exactly what was done
  • You can write your own functions
  • Lots of online help available
  • Can use a nice GUI front end such as Rstudio
  • Can embed your R analyses in dynamic, polished files using R markdown

RMarkdown

BASICS of R

  • Commands can be submitted through the terminal, console or scripts
  • In your scripts, anything that follows '#' symbol (aka hash) is just for humans
  • Notice on these slides I'm evaluating the code chunks and showing output
  • The output is shown here after the two # symbols and the number of output items is in []
  • Also notice that R follows the normal priority of mathematical evaluation
4*4
## [1] 16
(4+3*2^2)
## [1] 16

Assigning Variables

  • A better way to do this is to assign variables
  • Variables are assigned values using the <- operator.
  • Variable names must begin with a letter, but other than that, just about anything goes.
  • Do keep in mind that R is case sensitive.

Assigning Variables

x <- 2
x * 3
## [1] 6
y <- x * 3
y - 2
## [1] 4

These do not work

3y <- 3
3*y <- 3

Arithmetic operations on functions

  • Arithmetic operations can be performed easily on functions as well as numbers.
  • Try the following, and then your own.
x+2
x^2
log(x)
  • Note that the last of these - log - is a built in function of R, and therefore the object of the function needs to be put in parentheses
  • These parentheses will be important, and we'll come back to them later when we add arguments after the object in the parentheses
  • The outcome of calculations can be assigned to new variables as well, and the results can be checked using the 'print' command

Arithmetic operations on functions

y <- 67
print(y)
## [1] 67
x <- 124
z <- (x*y)^2
print(z)
## [1] 69022864

STRINGS

  • Variables and operations can be performed on characters as well
  • Note that characters need to be set off by quotation marks to differentiate them from numbers
  • The c stands for concatenate
  • Note that we are using the same variable names as we did previously, which means that we're overwriting our previous assignment
  • A good rule of thumb is to use new names for each variable, and make them short but still descriptive

STRINGS

x <- "I Love"
print (x)
## [1] "I Love"
y <- "Biostatistics"
print (y)
## [1] "Biostatistics"
z <- c(x,y)
print (z)
## [1] "I Love"        "Biostatistics"

FACTORS

  • The variable z is now what is called a list of character values.
  • Sometimes we would like to treat the characters as if they were units for subsequent calculations.
  • These are called factors, and we can redefine our character variables as factors.
  • This might seem a bit strange, but it’s important for statistical analyses where we might want to see the mean or variance for two different treatments.

FACTORS

z_factor <- as.factor(z)
print (z_factor)
  • Note that factor levels are reported alphabetically

VECTORS

  • In general R thinks in terms of vectors (a list of characters, factors or numerical values) and it will benefit any R user to try to write programs with that in mind, as it will simplify most things.
  • Vectors can be assigned directly using the 'c()' function and then entering the exact values.

VECTORS

x <- c(2,3,4,2,1,2,4,5,10,8,9)
print(x)
##  [1]  2  3  4  2  1  2  4  5 10  8  9

Basic Statistics

  • Many functions exist to operate on vectors.
  • Combine these with your previous variable to see what happens.
  • Also, try to find other functions (e.g. standard deviation).

Basic Statistics

mean(x)
median(x)
var(x)
log(x)
ln(x)
sqrt(x)
sum(x)
length(x)
sample(x, replace = T)
  • Notice that the last function (sample) has an argument (replace=T)
  • Arguments simply modify or direct the function in some way
  • There are many arguments for each function, some of which are defaults

Getting Help

  • Getting Help on any function is very easy - just type a question mark and the name of the function.
  • There are functions for just about anything within R and it is easy enough to write your own functions if none already exist to do what you want to do.
  • In general, function calls have a simple structure: a function name, a set of parentheses and an optional set of parameters to send to the function.
  • Help pages exist for all functions that, at a minimum, explain what parameters exist for the function.
  • Help can be accessed a few ways - try them :

Getting Help

- help(mean)
- ?mean
- example(mean)
- help.search("mean")
- apropos("mean")
- args(mean)

Creating vectors

  • Creating vector of new data by entering it by hand can be a drag
  • However, it is also very easy to use functions such as seq and sample
  • Try the examples below Can you figure out what the three arguments in the parentheses mean?
  • Try varying the arguments to see what happens.
  • Don't go too crazy with the last one or your computer might slow way down

Creating vectors

seq_1 <- seq(0.0, 10.0, by = 0.1)
print(seq_1)
##   [1]  0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0  1.1  1.2  1.3
##  [15]  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4  2.5  2.6  2.7
##  [29]  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9  4.0  4.1
##  [43]  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4  5.5
##  [57]  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
##  [71]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3
##  [85]  8.4  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7
##  [99]  9.8  9.9 10.0
seq_2 <- seq(10.0, 0.0, by = -0.1)
print(seq_2)
##   [1] 10.0  9.9  9.8  9.7  9.6  9.5  9.4  9.3  9.2  9.1  9.0  8.9  8.8  8.7
##  [15]  8.6  8.5  8.4  8.3  8.2  8.1  8.0  7.9  7.8  7.7  7.6  7.5  7.4  7.3
##  [29]  7.2  7.1  7.0  6.9  6.8  6.7  6.6  6.5  6.4  6.3  6.2  6.1  6.0  5.9
##  [43]  5.8  5.7  5.6  5.5  5.4  5.3  5.2  5.1  5.0  4.9  4.8  4.7  4.6  4.5
##  [57]  4.4  4.3  4.2  4.1  4.0  3.9  3.8  3.7  3.6  3.5  3.4  3.3  3.2  3.1
##  [71]  3.0  2.9  2.8  2.7  2.6  2.5  2.4  2.3  2.2  2.1  2.0  1.9  1.8  1.7
##  [85]  1.6  1.5  1.4  1.3  1.2  1.1  1.0  0.9  0.8  0.7  0.6  0.5  0.4  0.3
##  [99]  0.2  0.1  0.0

Creating vectors

seq_square <- (seq_2)*(seq_2)
print(seq_square)
##   [1] 100.00  98.01  96.04  94.09  92.16  90.25  88.36  86.49  84.64  82.81
##  [11]  81.00  79.21  77.44  75.69  73.96  72.25  70.56  68.89  67.24  65.61
##  [21]  64.00  62.41  60.84  59.29  57.76  56.25  54.76  53.29  51.84  50.41
##  [31]  49.00  47.61  46.24  44.89  43.56  42.25  40.96  39.69  38.44  37.21
##  [41]  36.00  34.81  33.64  32.49  31.36  30.25  29.16  28.09  27.04  26.01
##  [51]  25.00  24.01  23.04  22.09  21.16  20.25  19.36  18.49  17.64  16.81
##  [61]  16.00  15.21  14.44  13.69  12.96  12.25  11.56  10.89  10.24   9.61
##  [71]   9.00   8.41   7.84   7.29   6.76   6.25   5.76   5.29   4.84   4.41
##  [81]   4.00   3.61   3.24   2.89   2.56   2.25   1.96   1.69   1.44   1.21
##  [91]   1.00   0.81   0.64   0.49   0.36   0.25   0.16   0.09   0.04   0.01
## [101]   0.00

Creating vectors

seq_square_new <- (seq_2)^2
print(seq_square_new)
##   [1] 100.00  98.01  96.04  94.09  92.16  90.25  88.36  86.49  84.64  82.81
##  [11]  81.00  79.21  77.44  75.69  73.96  72.25  70.56  68.89  67.24  65.61
##  [21]  64.00  62.41  60.84  59.29  57.76  56.25  54.76  53.29  51.84  50.41
##  [31]  49.00  47.61  46.24  44.89  43.56  42.25  40.96  39.69  38.44  37.21
##  [41]  36.00  34.81  33.64  32.49  31.36  30.25  29.16  28.09  27.04  26.01
##  [51]  25.00  24.01  23.04  22.09  21.16  20.25  19.36  18.49  17.64  16.81
##  [61]  16.00  15.21  14.44  13.69  12.96  12.25  11.56  10.89  10.24   9.61
##  [71]   9.00   8.41   7.84   7.29   6.76   6.25   5.76   5.29   4.84   4.41
##  [81]   4.00   3.61   3.24   2.89   2.56   2.25   1.96   1.69   1.44   1.21
##  [91]   1.00   0.81   0.64   0.49   0.36   0.25   0.16   0.09   0.04   0.01
## [101]   0.00

Drawing samples from distributions

  • Here is a way to create your own data sets that are random samples.
  • Again, play around with the arguments in the parentheses to see what happens.

Drawing samples from distributions

x <- rnorm (10000, 0, 10)
y <- sample (1:10000, 10000, replace = T)
xy <- cbind(x,y)
plot(x,y) 

Drawing samples from distributions

x <- rnorm (10000, 0, 10)
y <- sample (1:10000, 10000, replace = T)
xy <- cbind(x,y)
plot(xy)

Drawing samples from distributions

x <- rnorm (10000, 0, 10)
y <- sample (1:10000, 10000, replace = T)
xy <- cbind(x,y)
hist(x)

Drawing samples from distributions

  • You’ve probably figured out that y from the last example is drawing numbers with equal probability.
  • What if you want to draw from a distribution?
  • Again, play around with the arguments in the parentheses to see what happens.

Drawing samples from distributions

x <-rnorm(1000, 0, 100)
hist(x, xlim = c(-500,500))
curve(50000*dnorm(x, 0, 100), xlim = c(-500,500), add=TRUE, col='Red')

- dnorm() generates the probability density, which can be plotted using the curve() function. - Note that is curve is added to the plot using add=TRUE

Visualizing Data

  • So far you've been visualizing just the list of output numbers
  • Except for the last example where I snuck in a hist function.
  • You can also visualize all of the variables that you've created using the plot function (as well as a number of more sophisticated plotting functions).
  • Each of these is called a high level plotting function, which sets the stage
  • Low level plotting functions will tweak the plots and make them beautiful

Visualizing Data

  • What do you think that each of the arguments means for the plot function?
  • A cool thing about R is that the options for the arguments make sense.
  • Try adjusting an argument and see if it works
  • Note next week we will be exploring the plotting in GGPlot2

Visualizing Data

seq_1 <- seq(0.0, 10.0, by = 0.1) 
plot (seq_1, xlab="space", ylab ="function of space", type = "p", col = "red")

Putting plots in a single figure

  • On the next slide
  • The first line of the lower script tells R that you are going to create a composite figure that has two rows and two columns. Can you tell how?
  • Now, modify the code to add two more variables and add one more row of two panels.
seq_1 <- seq(0.0, 10.0, by = 0.1)
seq_2 <- seq(10.0, 0.0, by = -0.1)

Putting plots in a single figure

par(mfrow=c(2,2))
plot (seq_1, xlab="time", ylab ="p in population 1", type = "p", col = 'red')
plot (seq_2, xlab="time", ylab ="p in population 2", type = "p", col = 'green')
plot (seq_square, xlab="time", ylab ="p2 in population 2", type = "p", col = 'blue')
plot (seq_square_new, xlab="time", ylab ="p in population 1", type = "l", col = 'yellow')

Example using binomial distribution

  • As above for the normal distribution, data can be generated by being sampled from nearly any distribution and then visualized.
  • Below I’m having you use the ‘histogram’ function. What does it do?

Example using binomial distribution

  • 10 successes (out of 20 trials) is the most frequent outcome
heads <- rbinom(n=1000, size=20, prob=0.5)
hist(heads)

Example using binomial distribution

  • This kind of statement can be run in one line as well, which is sometimes easier.
hist(rbinom(n=1000, size=20, prob=0.5))

Creating Data Frames in R

  • As you have seen, in R you can generate your own random data set drawn from nearly any distribution very easily.
  • Often we will want to use collected data.
  • Now, let’s make a dummy dataset to get used to dealing with data frames
  • Set up three variables (habitat, temp and elevation) as vectors
habitat <- factor(c("mixed", "wet", "wet", "wet", "dry", "dry", "dry","mixed"))
temp <- c(3.4, 3.4, 8.4, 3, 5.6, 8.1, 8.3, 4.5)
elevation <- c(0, 9.2, 3.8, 5, 5.6, 4.1, 7.1, 5.3)
  • Create a data frame where vectors become columns
mydata <- data.frame(habitat, temp, elevation)
row.names(mydata) <- c("Reedy Lake", "Pearcadale", "Warneet", "Cranbourne", 
                       "Lysterfield", "Red Hill", "Devilbend", "Olinda")
  • Now you have a hand-made data frame with row names

Reading in Data Frames in R

  • A strength of R is being able to import data from an external source
  • Create the same table that you did above in a spreadsheet like Excel
  • Export it to comma separated and tab separated text files for importing into R.
  • The first will read in a comma-delimited file, whereas the second is a tab-delimited
  • In both cases the header and row.names arguments indicate that there is a header row and row label column
  • Note that the name of the file by itself will have R look in the CWD, whereas a full path can also be used

Reading in Data Frames in R

YourFile <- read.table('yourfile.csv', header=T, row.names=1, sep=',')
YourFile <- read.table('yourfile.txt', header=T, row.names=1, sep='\t')

Exporting Data Frames in R

write.table(YourFile, "yourfile.csv", quote=F, row.names=T, sep=",")
write.table(YourFile, "yourfile.txt", quote=F, row.names=T, sep="\t")

Indexing in data frames

  • Next up - indexing just a subset of the data
  • This is a very important idea in R, that you can analyze just a subset of the data.
  • This is analyzing only the data in the file you made that has the factor value 'mixed'.
print (YourFile[,2])
print (YourFile$temp)
print (YourFile[2,])
plot (YourFile$temp, YourFile$elevation)

Indexing in data frames

  • You can perform operations on particular levels of a factor
  • Calculating the mean of the 'mixed' and 'gipps' levels of habitat.
  • Note that the first argument is the numerical column vector, and the second is the factor column vector.
  • The third is the operation. Reversing the first two does not work (the one below).
tapply(YourFile$temp, YourFile$habitat, mean)
tapply(YourFile$temp, YourFile$habitat, var)

R INTERLUDE

Some real transcriptomic data

  • Examine the text file: GacuRNAseq_Subset.csv
  • How many many rows and columns are there?
  • How many different variables are there?
  • What are the general types of variables?
  • Now let’s read the data file into R and analyze it
  • This exercise will help you get used to reading in and manipulating genomic data files
  • First off, remember to set your working directory to find your file correctly

Some real transcriptomic data

RNAseq_Data <- read.table('GacuRNAseq_subset.csv', header=TRUE, sep=',')

print (RNAseq_Data)
head (RNAseq_Data)
tail (RNAseq_Data)

print (RNAseq_Data[,2])
print (RNAseq_Data[1,])
print (RNAseq_Data[1,2])
print (RNAseq_Data$ENSGACG00000000010)
print (RNAseq_Data$ENSGACG00000000010>45.0)

Summary stats and figures

summary1 <- summary(RNAseq_Data $ENSGACG00000000003)
print (summary1)

hist(RNAseq_Data $ENSGACG00000000003)
boxplot(RNAseq_Data$ENSGACG00000000003)
boxplot(RNAseq_Data$ENSGACG00000000003~RNAseq_Data$Population)
plot(RNAseq_Data $ENSGACG00000000003, RNAseq_Data$ENSGACG00000000003)

boxplot(RNAseq_Data $ENSGACG00000000003~RNAseq_Data$Treatment, 
        col = "red", ylab = "Expression Level", xlab = "Treatment level", 
        border ="orange", 
        main = "Boxplot of variation in gene expression across microbiota treatments")

Data wrangling and exploratory data analysis (EDA)

Tidyverse family of packages

Tidyverse family of packages

  • Hadley Wickham and others have written R packages to modify data

  • These packages do many of the same things as base functions in R

  • However, they are specifically designed to do them faster and more easily

  • Wickham also wrote the package GGPlot2 for elegant graphics creations

  • GG stands for ‘Grammar of Graphics’

Example of a tibble

Example of a tibble

Types of vectors of data

Types of vectors of data

int stands for integers

dbl stands for doubles, or real numbers

chr stands for character vectors, or strings

dttm stands for date-times (a date + a time)

lgl stands for logical, vectors that contain only TRUE or FALSE

fctr stands for factors, which R uses to represent categorical variables with fixed possible values

date stands for dates

Types of vectors of data

  • Logical vectors can take only three possible values:
    • FALSE
    • TRUE
    • NA which is 'not available'.
  • Integer and double vectors are known collectively as numeric vectors.
    • In R numbers are doubles by default.
  • Integers have one special value: NA, while doubles have four:
    • NA
    • NaN which is 'not a number'
    • Inf
    • -Inf

Types of vectors of data

  • R will also implicitly coerce the length of vectors.
    • This is called vector recycling
    • The shorter vector is repeated or recycled
    • The shorter vector will be made the same length as the longer vector
  • R will expand the shortest vector to the same length as the longest
    • This is so-called recycling
    • This is silent except when the length of the longer is not an integer multiple of the length of the shorter
    • When it is not you’ll get an error
  • The vectorised functions in tidyverse will throw errors when you recycle anything other than a scalar.
    • If you do want to recycle something other than a scaler
    • You’ll need to do it yourself with rep()

Key functions in dplyr for vectors

  • Pick observations by their values with filter().
  • Reorder the rows with arrange().
  • Pick variables by their names with select().
  • Create new variables with functions of existing variables with mutate().
  • Collapse many values down to a single summary with summarise().

filter(), arrange() & select()

filter(flights, month == 11 | month == 12)
arrange(flights, year, month, day)
select(flights, year, month, day)

mutate() & transmutate()

This function will add a new variable that is a function of other variable(s)

mutate(flights_sml,
  gain = arr_delay - dep_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)

This function will eplace the old variable with the new variable

transmute(flights,
  gain = arr_delay - dep_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)

group_by( ) & summarize( )

This first function allows you to aggregate data by values of categorical variables

by_day <- group_by(flights, year, month, day)

Once you have done this aggregation, you can then calculate values (in this case the mean) of other variables split by the new aggregated levels of the categorical variable

summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
  • Note - you can get a lot of missing values!
  • That’s because aggregation functions obey the usual rule of missing values:
    • if there’s any missing value in the input, the output will be a missing value.
    • fortunately, all aggregation functions have an na.rm argument which removes the missing values prior to computation

R INTERLUDE

Playing with Tidyverse functions

  • Examine my R script TidyVerse.R’
  • Step 1 - Read in the GacuRNAseq_Subset.csv dataset
  • Step 2 - Make the dataset into a tibble
  • Step 3 - Select all of the categorical variables and only 4 of the gene count variables and put them into a new variable
  • Step 4 - Mutate each of the 4 gene expression values by performing a square root transformation making a new variable for each of the original (keep all 8 in the dataset).
  • Step 5 - Summarize the mean and standard deviation for each of the gene count variables grouped by the ‘sex’ and ‘population’ and ‘treatment’ categorical variables
  • Step 6 - Create a histogram for one of the original gene expression variables, and one of the derived variables
  • Step 7 - Create a box plot for one of the original gene expression variables, and one of the derived variables, split by treatment
  • Step 8 - Write the final data table to a .csv file and one of the figures to a .pdf file

Git and GitHub

Git and GitHub

Clone the repository

  • First make a new directory into which you will clone our course repository
  • Open the terminal and navigate to the directory and type the following
git clone https://github.com/wcresko/UO_ABS.git
  • Now to update the repository you just need to use these commands
git status

git merge origin/master
  • The first command just tells you if anything has changed
  • If so, do the second!